Skip to content

Conversation

normen662
Copy link
Contributor

@normen662 normen662 commented Sep 16, 2025

This PR implements a HNSW (Hierarchical Navigable Small World) structure as an part of fdb-extension according to https://arxiv.org/pdf/1603.09320.

  • the implementation is a basic structure that lives inside of a subspace in fdb
  • currently we store and retrieve vectors of class Vector that holds component data using Halfs which are FP16 encoded.
  • we currently only support insertion, batch insertion and retrieval of data via approximative nearest neighbor searches
  • we currently support the following options:
    • M the optimal number of connections between nodes within a layer of the HNSW
    • MMax the maximum number of connections between nodes within a layer of the HNSW
    • MMax0 the maximum number of connections between nodes within layer 0 of the HNSW
    • efConstruction the size of the search queue during thorough (non-greedy) searches during insertion
    • extendCandidates a boolean indicating if we should extend the candidates we can connect to to by also searching within the neighbors of already found candidates
    • keepPrunedConnections a boolean indicating whether under certain circumstances discarded candidates to connect to are reevaluated
  • deletes are not implemented (yet)
  • tests include basic testing using the SIFTsmall data set that is comprised of 10k vectors, 100 queries

@normen662 normen662 force-pushed the hnsw branch 2 times, most recently from 8a3c53c to 87c5674 Compare September 19, 2025 10:43
@normen662 normen662 added the enhancement New feature or request label Sep 19, 2025
@normen662 normen662 force-pushed the hnsw branch 2 times, most recently from fde3d47 to 597d08b Compare September 23, 2025 16:42
Copy link
Collaborator

@alecgrieser alecgrieser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done a first pass of the vector logic, that is, the Vector class, the Metric and Metrics classes, and the serialization logic in StorageAdapter, along with the attendant tests. In theory, we could take these separately, which might allow us to think more about how we'd expose these through the SQL type system above

grpc = "1.64.1"
grpc-commonProtos = "2.37.0"
guava = "33.3.1-jre"
half4j = "0.0.2"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is probably somewhat of an open question as to whether we want to add a new dependency to support fp16 encoding or not. It is certainly a bit of a shame that all consumers would need to take on the dependency, even if they do not end up using the HNSW index.

I could see the argument that we need yet another subproject here for vector indexes, and that's where we put the new code, like was done with fdb-record-layer-lucene. Then projects that don't use the new vector sub-component don't need to worry about this.

I did also take a quick look, and it actually seems like writing our own fp32 to fp16 utility wouldn't be that hard. In the common case, it's just a matter of truncating the mantissa and re-encoding the exponent. There are some truncation edge cases (e.g., the numbers can blow up to infinity for a lot of exponent values) that aren't that bad, and then the only "hard" case is making sure that you can convert from a normal 32 (or 64) bit value to a sub-normal 16-bit value. But even that isn't too bad. I have some prototype code locally that could theoretically do that, using java short values as the storage for 16-bit values. Getting the logic to go back and forth is actually not too bad; if we actually want to be able to do fp16 math on those shorts, that's a little more intense, though actually maybe not too too bad if we absolutely needed to.

appender.console.layout.pattern = %d [%level] %logger{1.} - %m %X%n%ex{full}

rootLogger.level = debug
rootLogger.level = info
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason for this? Should we revert it?

*/
public abstract class Vector {
@Nonnull
final double[] data;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason the base Vector class wraps a double vector? Rather than, say, a FloatVector wrapping a float[]?

* Returns the number of elements in the vector.
* @return the number of elements
*/
public int size() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: Should this be dimension rather than size? I don't think I actually have an opinion here, but it occurs to me as a decision we could make

* data type conversions and raw data representation.
*/
public abstract class Vector {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be the answer is "no", but is there a better name than vector? The main concern is that it conflicts with the abstract data type name. But it also appears to be the name that other systems use for this. The only alternative name I could think of would be something like FloatingPointVector or like CartesianVector or something, neither of which I'm all that happy with

Comment on lines +29 to +33
private final Metric.ManhattanMetric manhattanMetric = new Metric.ManhattanMetric();
private final Metric.EuclideanMetric euclideanMetric = new Metric.EuclideanMetric();
private final Metric.EuclideanSquareMetric euclideanSquareMetric = new Metric.EuclideanSquareMetric();
private final Metric.CosineMetric cosineMetric = new Metric.CosineMetric();
private Metric.DotProductMetric dotProductMetric;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things:

  1. Should this reference the Metric objects from the Metrics enum? It seems weird to have that enum but still create instances of the metric implementations on their own
  2. Why is dotProductMetric handled differently from all of the other metrics

}

@Test
public void manhattanMetricDistanceWithIdenticalVectorsShouldReturnZeroTest() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be parameterized on the Metric type? I believe all of the metrics should return zero on identical vectors except for dot product. The EnumSource argument provider can be set up to exclude values, which we could do to skip that particular one

}

@Test
public void euclideanMetricDistanceWithDifferentPositiveVectorsShouldReturnCorrectDistanceTest() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This correct distance metric could also be made generic on the different types, though it would require a more involved arguments provider. Probably something like:

static Stream<Arguments> metricAndExpectedDistance() {
    // Distance between (1.0, 2.0) and (4.0, 6.0)
    return Stream.of(
        Arguments.of(Metrics.MANHATTAN, 7.0), // |1 - 4| + |2 - 6| = 7
        Arguments.of(Metrics.EUCLIDEAN, 5.0),  // sqrt((1-4)^2 + (2-6)^2) = sqrt(9 + 16) = 5.0
        Arguments.of(Metrics.SQUARE_EUCLIDEAN, 25.0), // (1-4)^2 + (2-6)^2 = 9 + 16 = 25
        // etc
    );
}


import static org.junit.jupiter.api.Assertions.assertEquals;

public class MetricTest {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also probably add tests of the comparative distance function, to be sure. I could also see us adding a test of the triangle inequality, for the two that support it.

import java.util.stream.LongStream;
import java.util.stream.Stream;

public class VectorTest {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would probably be a good spot to add a few more unit tests of the Vector class, including things like the hash code and converting between HalfVector and DoubleVector

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants